Skip to content

fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533

Open
kjmph wants to merge 7 commits into
facebookincubator:mainfrom
kjmph:fix/experimental-cudf/use-after-free
Open

fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533
kjmph wants to merge 7 commits into
facebookincubator:mainfrom
kjmph:fix/experimental-cudf/use-after-free

Conversation

@kjmph
Copy link
Copy Markdown

@kjmph kjmph commented May 15, 2026

UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them.

This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack:

cudf::concatenate
UcxPartitionedOutput::flushPending

Fix the lifetime ordering without a CPU-blocking synchronize:

  • make the concat/output stream wait for all input streams before reading input table views;
  • prefer rebinding owned cuDF input buffers to the output stream so their stream-ordered deallocation happens after concat work on that stream;
  • fall back to a single event recorded on the output stream and waited by all input streams when inputs cannot be rebound.

This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit.

Validated by:

  • TPC-H Q18 GPU correctness loop against stored DuckDB reference
  • compute-sanitizer with stream-ordered race tracking, confirming the original concat use-after-free is no longer reported

@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 4fb8b97
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a231cde6038ea00081f9c81

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 15, 2026

Hi @kjmph!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Comment thread velox/experimental/cudf/vector/CudfVector.cpp Outdated
Comment thread velox/experimental/cudf/exec/Utilities.cpp Outdated
Comment thread velox/experimental/cudf/vector/CudfVector.cpp
@mbasmanova mbasmanova changed the title Fix UCX output buffer lifetime after async cuDF concat fix(cudf): Fix UCX output buffer lifetime after async cuDF concat May 18, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

  • velox/experimental/cudf/exec/Utilities.cpp
  • velox/experimental/cudf/exec/Utilities.h
  • velox/experimental/cudf/tests/CMakeLists.txt
  • velox/experimental/cudf/tests/CudfVectorTest.cpp
  • velox/experimental/cudf/vector/CudfVector.cpp
  • velox/experimental/cudf/vector/CudfVector.h
  • velox/experimental/ucx-exchange/UcxPartitionedOutput.cpp

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Slow path • Graph generated from PR branch

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

CI Failure Analysis

Auto-generated by the CI Failure Analysis workflow. This comment is updated in place each time CI fails on a new commit, so it always reflects the latest run — re-pushing or re-running CI will refresh the analysis below. Last updated 2026-06-05 21:34:35 UTC from workflow run 27034362865.

🔴 Linux release with adapters — BUILD Failure View logs

Build errors:

The build fails compiling the new test file CudfVectorTest.cpp because cudaError (a CUDA enum) is not formattable by fmt. The VELOX_CHECK_EQ macro at line 46 tries to format two cudaError values for the error message, but no fmt::formatter<cudaError> specialization exists.

FAILED: velox/experimental/cudf/tests/CMakeFiles/velox_cudf_vector_test.dir/CudfVectorTest.cpp.o

velox/experimental/cudf/tests/CudfVectorTest.cpp:46:5:   required from here
  269 |     auto message = ::facebook::velox::errorMessage(__VA_ARGS__);
      |                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~

/usr/local/include/fmt/base.h:2262:45: error: 'fmt::v11::detail::type_is_unformattable_for<cudaError, char> _'
                                         has incomplete type
 2262 |     type_is_unformattable_for<T, char_type> _;
      |                                             ^

The offending line in the new test file (CudfVectorTest.cpp:46) is:

VELOX_CHECK_EQ(
    cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking),
    cudaSuccess);

VELOX_CHECK_EQ uses fmt::format to render both arguments when the check fails. Since cudaError (the return type of cudaStreamCreateWithFlags) has no fmt formatter, compilation fails with -Werror.


Correlation with PR changes:

  • This failure is directly caused by this PR. The file velox/experimental/cudf/tests/CudfVectorTest.cpp is a new file added in this PR. The TestCudaStream constructor at line 46 uses VELOX_CHECK_EQ with cudaError values, which triggers the fmt unformattable type error.

Known issues:

  • No existing open issues track this specific failure.
  • Recent main branch CI runs (Linux Build using GCC) all pass, confirming this is not a pre-existing issue.

Reproduce locally:

make release  # will fail during compilation of CudfVectorTest.cpp

Recommended fix:

Replace VELOX_CHECK_EQ with CUDF_CUDA_TRY (which is already used elsewhere in the same file), or cast the cudaError values to int so fmt can format them. The simplest fix is to use the same pattern already present in the helper function makeTable():

// Instead of:
VELOX_CHECK_EQ(
    cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking),
    cudaSuccess);

// Use:
CUDF_CUDA_TRY(cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking));

This is consistent with the CUDF_CUDA_TRY usage on line 140 of the same file and avoids the fmt formatter issue entirely.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 27, 2026
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 27, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Comment thread velox/experimental/cudf/vector/CudfVector.cpp
Comment thread velox/experimental/cudf/exec/Utilities.cpp Outdated
@kjmph kjmph force-pushed the fix/experimental-cudf/use-after-free branch from 832f70e to a844318 Compare June 4, 2026 14:37
kjmph added 2 commits June 5, 2026 11:32
UcxPartitionedOutput::flushPending() synchronizes input streams before
launching cudf::concatenate(), but concatenate itself is asynchronous
on the output stream. When multiple pending inputs were flushed, the
operator could clear pendingInputs_ immediately after launching
concat, allowing the input CudfVector buffers to be deallocated before
the concat kernels finished reading them.

This caused nondeterministic GPU result corruption. A TPC-H Q18
correctness loop against a DuckDB reference reproduced the issue
regularly, and compute-sanitizer reported a use-after-free with this
stack:

  cudf::concatenate
  UcxPartitionedOutput::flushPending

Fix the lifetime ordering without a CPU-blocking synchronize:

* make the concat/output stream wait for all input streams before
  reading input table views;
* prefer rebinding owned cuDF input buffers to the output stream so
  their stream-ordered deallocation happens after concat work on that
  stream;
* fall back to a single event recorded on the output stream and waited
  by all input streams when inputs cannot be rebound.

This keeps the existing memory behavior of releasing pending input
buffers before partitioning, while making their async lifetime
ordering explicit.

Validated by:
* TPC-H Q18 GPU correctness loop against stored DuckDB reference
* compute-sanitizer with stream-ordered race tracking, confirming the
  original concat use-after-free is no longer reported
Update the cuDF vector deallocation ordering helpers to accept
std::span instead of concrete std::vector references. These helpers
only need read-only views over CudfVector pointers and CUDA streams.
@kjmph kjmph force-pushed the fix/experimental-cudf/use-after-free branch from a844318 to a891a35 Compare June 5, 2026 15:33
kjmph added 2 commits June 5, 2026 11:35
CudfVector::rebindStream returned early when the vector's logical
stream already matched the target stream. That misses packed_table
storage, where gpu_data may still have a different deallocation stream
after an intra-node UCX handoff.

Update packed_table storage by setting the packed buffer deallocation
stream directly. For owned cudf::table storage, rely on the current
cuDF rebind_stream API and remove the old compatibility shim.

Add CudfVector tests that verify both owned table and packed table
buffers are deallocated on the rebound stream.
When CudfVector buffers cannot be rebound to the output stream,
orderCudfVectorDeallocationsAfterStream falls back to recording an
event on the output stream and making the input streams wait on it.

Reuse a per-thread, per-device CUDA event for this fallback instead of
creating and destroying a new event on each call. Intentionally leak
the event to avoid CUDA calls from thread-local destructors after CUDA
context teardown, matching libcudf's stream synchronization pattern.
Comment thread velox/experimental/cudf/exec/Utilities.cpp
Comment thread velox/experimental/cudf/vector/CudfVector.h Outdated
Comment thread velox/experimental/cudf/tests/CudfVectorTest.cpp Outdated
Comment thread velox/experimental/cudf/tests/CudfVectorTest.cpp Outdated
kjmph and others added 3 commits June 5, 2026 14:45
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Use cuda_async_memory_resource in the CudfVector rebind tests so the
instrumented resource records stream-ordered deallocation behavior.

cuda_memory_resource ignores the stream and frees via cudaFree, so it
only verified the wrapper-observed stream. The async resource makes
the test match the lifetime ordering that CudfVector::rebindStream is
intended to control.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants